Post

Introduction to Web Scraping

Contents

1. What is Web Scraping?

web_tech

Web scraping is the automated process of collecting data from websites using programs — instead of manually copying data line by line. We can write a few lines of code to get hundreds or thousands of data items in just minutes.

2. Common Types of Websites

Websites are often classified by several criteria:

  • By dynamism: Static vs Dynamic websites
  • By frontend/backend technologies: React, Vue, Django, Laravel, etc.
  • By code architecture: Monolith, Microservices, etc.
  • By rendering technologies: SSR, CSR, hybrid

In this introduction, we only focus on classification by dynamism.

3. Static & Dynamic Websites

static_dynamic

Static Websites:

  • Use only HTML and CSS; content is “fixed” — doesn’t change per visitor.
  • Easy to scrape since content is already in the HTML.
  • Examples: Portfolio sites, product landing pages.

Dynamic Websites:

  • Use backend processing — often PHP, Node.js, Python, etc.
  • Content changes based on user interaction or is loaded by JavaScript.
  • Harder to scrape because you must wait for the page to fully load.
  • Examples: Shopee, Facebook, real-time price tracking sites.

4. What is the DOM Structure?

The DOM (Document Object Model) is a tree structure of a web page. Each HTML tag is a node, which can be a parent or child of other nodes.

Simple example:

1
2
3
4
<body>
  <h1>Title</h1>
  <p>Description here</p>
</body>

In this example:

  • <body> is the parent

  • <h1> and <p> are children

Larger DOM example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<html>
  <head>
    <title>Trang A</title>
  </head>
  <body>
    <div class="header">
      <h1>Welcome</h1>
    </div>
    <div class="content">
      <ul>
        <li>Sách 1</li>
        <li>Sách 2</li>
      </ul>
    </div>
    <footer>Liên hệ</footer>
  </body>
</html>
  • <html> is the root node containing the entire webpage.

  • <head> contains page information like the title, not directly visible.

  • <title> is the page title shown on the browser tab.

  • <body> holds the main content visible to users.

  • Inside <body>, there are smaller parts called child nodes:

    • <div class="header"> contains the main header <h1>Welcome</h1>.

    • <div class="content"> contains a list of books with <li> items.

    • <footer> is the footer section with the text “Contact”.

This structure is like a tree, each tag is a branch or leaf, helping us easily find and extract data when scraping websites.

5. Goals and Applications of Web Scraping

html

🎯 Main goals:

  • Automate data collection (fast, save effort)
  • Analyze and compare prices (products, crypto, flight tickets, etc.)
  • Track content changes (news, prices, rankings, etc.)
  • Create datasets for research, machine learning, statistics
  • Integrate into internal systems like dashboards or apps

  • requests – Send HTTP requests, fetch HTML content
  • BeautifulSoup (bs4) – Easy HTML parsing and extraction
  • lxml – Fast and powerful parser for HTML/XML
  • selenium – Automate interaction with dynamic (JS) sites
  • scrapy – Framework for large crawling projects
  • httpx – Similar to requests but supports async
  • pyppeteer, playwright – Headless browser control, good for JS-heavy sites

🛠 Choose libraries based on your goals. For static sites, requests + BeautifulSoup is usually enough.


7. Real example: Scraping books from books.toscrape.com

The site books.toscrape.com is a sample site for practicing web scraping.

  • It is a static website, ideal for beginners
  • Contains 1000 books spread across 50 pages
  • Simple URL structure:
1
https://books.toscrape.com/catalogue/page-{page_number}.html

Download source code

  • res = requests.get(url): Sends an HTTP request to get the webpage content at the given URL.

  • soup = BeautifulSoup(res.text, 'html.parser'): Parses the HTML content of the page using BeautifulSoup for easier processing.

  • books = soup.select(".col-xs-6.col-sm-4.col-md-3.col-lg-3"): Selects all HTML elements with the class "col-xs-6 col-sm-4 col-md-3 col-lg-3" — these are the tags containing information for each book on the page.

Each element in books is a “node” containing detailed information about a book, making it easy to extract details like title, image, rating, price, etc.

This post is licensed under CC BY 4.0 by the author.